LA Crime¶
Author: Dhanush Vasa
Table of Contents:¶
- Introduction
- Data Collection
- Data Cleaning and Exploratory Analysis
- Modeling
- Interpretation of Results
- Conclusion
1. Introduction¶
The aim of this tutorial is to guide through the data science lifecycle, providing an introduction to various key concepts in data science. The stages of the data science lifecycle:
- Data Collection
- Data Cleaning
- Exploratory Analysis and Visualization
- Modeling
- Results Interpretation
Crime in Los Angeles is a complicated and dynamic issue, influenced by the city's size, diversity, and socioeconomic circumstances. As one of the major metropolitan areas in the United States, Los Angeles sees a wide range of criminal activity and violence. Understanding crime patterns and trends in Los Angeles is crucial for keeping the public safe and ensuring that law enforcement agencies allocate resources efficiently.
Many factors influence criminal activity in Los Angeles, including local demographics, economic situations, and the physical environment. Some locations of the city are repeating hotspots for various sorts of crime, which are frequently associated with population density, accessibility, or the presence of specific establishments. For example, theft may be more prevalent in commercial districts, whereas violent crimes may cluster in economically deprived regions. Temporal trends are particularly important, as certain crimes tend to increase at specific seasons of year, days of the week, or even hours of the day.
By using statistics to examine crime in Los Angeles, policymakers and law enforcement may uncover critical patterns and build focused preventive and intervention initiatives. For example, assessing geographic trends might assist police patrols be more effectively assigned to high-crime areas, whilst investigating temporal trends can influence resource deployment during peak hours. Furthermore, knowing the basic causes of crime, whether they are social, economic, or environmental, can help drive community-based programs to reduce criminal behavior. A comprehensive approach to studying crime in Los Angeles is critical for developing safer communities and building confidence between residents and law enforcement.
Important Note:-
- In certain sections of the code, you may encounter warning messages. These can be safely ignored while focusing on the intended output of the code.
2. Data Collection¶
To begin any analysis, we must collect data that is relevant to the topic we want to answer. The quality of your machine learning model is directly proportional to the quality of the data it processes. A solid dataset guarantees that your model finds relevant patterns and draws intelligent conclusions. As a result, selecting the appropriate dataset is an important stage in the data science lifecycle.
In this tutorial, we will use the Crime_Data from 2020 to Present from LA dataset from OpenML, which provides a complete account of reported criminal episodes in a specific area beginning in 2020. This dataset contains critical properties such as unique report numbers, dates and times of reporting and occurrence, criminal descriptions with related codes, and specific geographic information such as region names, premises descriptions, and exact latitude/longitude coordinates. It also provides demographic information regarding victims, weapons used, and the status of each crime report.
For public safety agencies, analysts, and researchers, this dataset is significant because it makes it easier to identify patterns in crime, analyze hotspots, and assess the efficacy of law enforcement. By utilizing this data, we may investigate a range of use cases, including creating prediction models, comprehending the social elements that influence crime, and empowering decision-makers. Spatial data, for instance, might be used to identify high-crime areas and guide resource allocation, and knowledge of victim demographic trends could assist guide community safety efforts. Because of its extensive scope, this dataset is essential for methodically researching and tackling crime.
Importing Python Libraries¶
As shown below, we must import the necessary Python libraries before we can begin this course. Throughout the tutorial, these libraries will be crucial. Because the provided code is optimized for execution in Jupyter Notebook, it is advised to utilize it for this session. Because it makes data visualization and analysis easier, Jupyter Notebook is a popular tool among data scientists. As we move further, we will delve deeper into the functions and goals of each library as they relate to their respective applications.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
import geopandas as gpd
import folium
from shapely.geometry import Point
from folium.plugins import HeatMap
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Input
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
# Filter out the warnings
import warnings
warnings.filterwarnings("ignore")
Download and Import The Data¶
We head to the LA Crime dataset and lets download any the dataset file. We extract the data from the dataset file and keep in mind to keep the dataset file in the same folder as the program file is designed in that manner. So that you can can replicate it if required.
# Read the uploaded file to determine its format
file_path = 'dataset_'
# Read the first few bytes of the file to inspect its structure
with open(file_path, 'rb') as file:
file_head = file.read(512) # Read the first 512 bytes
file_head.decode(errors='replace')
'% Description:\n% This dataset, named Crime_Data_from_2020_to_Present.csv, provides a detailed record of reported criminal incidents in a given area from the year 2020 onwards. It includes comprehensive information per incident, such as report numbers, reporting and occurrence dates and times, crime descriptions with specific codes, the locations (including area names and numbers, premises, LAT/LON coordinates), and details about the victims and suspects involved. This dataset is instrumental for analysts, p'
Initializes an exploratory analysis by inspecting the first 512 bytes of a dataset file in binary mode to decode its structure, revealing metadata about criminal incidents, including report numbers, dates, times, crime descriptions, locations, and victim/suspect details.
# Display the first few lines of the file to identify delimiters or formatting issues
with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
for i in range(10): # Display the first 10 lines
print(file.readline())
This code snippet initializes an exploratory analysis by inspecting the first 512 bytes of a dataset file in binary mode to decode its structure, revealing metadata about criminal incidents, including report numbers, dates, times, crime descriptions, locations, and victim/suspect details.
% Description: % This dataset, named Crime_Data_from_2020_to_Present.csv, provides a detailed record of reported criminal incidents in a given area from the year 2020 onwards. It includes comprehensive information per incident, such as report numbers, reporting and occurrence dates and times, crime descriptions with specific codes, the locations (including area names and numbers, premises, LAT/LON coordinates), and details about the victims and suspects involved. This dataset is instrumental for analysts, public safety organizations, and researchers to understand crime patterns, allocate resources effectively, and develop crime prevention strategies. % % Attribute Description: % - DR_NO: A unique identifier for the crime report. % - Date Rptd & DATE OCC: The dates when the crime was reported and occurred. % - TIME OCC: The time when the crime occurred. % - AREA & AREA NAME: Numeric and textual descriptions of the area where the crime occurred. % - Rpt Dist No: The reporting district number. % - Part 1-2: Indicates whether the crime is a Part 1 (more severe) or Part 2 offense.
Reads and prints the first 10 lines of Crime_Data_from_2020_to_Present.csv in UTF-8 encoding to inspect its structure, delimiters, and metadata, revealing attributes like DR_NO, report dates, times, area details, district numbers, and offense classification (Part 1 or Part 2) for further analysis.
# Attempt to locate the line where the actual dataset starts
with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
lines = file.readlines()
# Display lines to find the starting point of the dataset
for i, line in enumerate(lines[:50]): # Check the first 50 lines
print(f"Line {i + 1}: {line.strip()}")
Line 1: % Description: Line 2: % This dataset, named Crime_Data_from_2020_to_Present.csv, provides a detailed record of reported criminal incidents in a given area from the year 2020 onwards. It includes comprehensive information per incident, such as report numbers, reporting and occurrence dates and times, crime descriptions with specific codes, the locations (including area names and numbers, premises, LAT/LON coordinates), and details about the victims and suspects involved. This dataset is instrumental for analysts, public safety organizations, and researchers to understand crime patterns, allocate resources effectively, and develop crime prevention strategies. Line 3: % Line 4: % Attribute Description: Line 5: % - DR_NO: A unique identifier for the crime report. Line 6: % - Date Rptd & DATE OCC: The dates when the crime was reported and occurred. Line 7: % - TIME OCC: The time when the crime occurred. Line 8: % - AREA & AREA NAME: Numeric and textual descriptions of the area where the crime occurred. Line 9: % - Rpt Dist No: The reporting district number. Line 10: % - Part 1-2: Indicates whether the crime is a Part 1 (more severe) or Part 2 offense. Line 11: % - Crm Cd & Crm Cd Desc: The crime code and its description. Line 12: % - Mocodes: Modus operandi codes related to the crime. Line 13: % - Vict Age, Vict Sex, Vict Descent: Age, sex, and ethnic descent of the victim. Line 14: % - Premis Cd & Premis Desc: Codes and descriptions of the premises where the crime occurred. Line 15: % - Weapon Used Cd & Weapon Desc: Codes and descriptions of any weapons used. Line 16: % - Status & Status Desc: The status of the crime report and its description (e.g., Invest Cont, Adult Arrest). Line 17: % - Crm Cd 1-4: Additional crime codes related to the incident. Line 18: % - LOCATION & Cross Street: The specific location and, if applicable, cross street of the crime. Line 19: % - LAT & LON: Latitude and longitude of the crime location. Line 20: % Line 21: % Use Case: Line 22: % This dataset is crucial for public safety analyses, allowing for the tracking of crime trends, hotspot identification, and the assessment of law enforcement effectiveness. It can also be utilized by policymakers for strategic planning and by academic researchers studying the sociology of crime or developing predictive models. Community groups may use this data to advocate for safety and support initiatives in their neighborhoods. Line 23: @RELATION Crime_Data_from_2020_to_present_in_Los_Angeles Line 24: Line 25: @ATTRIBUTE DR_NO INTEGER Line 26: @ATTRIBUTE "Date Rptd" STRING Line 27: @ATTRIBUTE "DATE OCC" STRING Line 28: @ATTRIBUTE "TIME OCC" INTEGER Line 29: @ATTRIBUTE AREA INTEGER Line 30: @ATTRIBUTE "AREA NAME" STRING Line 31: @ATTRIBUTE "Rpt Dist No" INTEGER Line 32: @ATTRIBUTE "Part 1-2" INTEGER Line 33: @ATTRIBUTE "Crm Cd" INTEGER Line 34: @ATTRIBUTE "Crm Cd Desc" STRING Line 35: @ATTRIBUTE Mocodes STRING Line 36: @ATTRIBUTE "Vict Age" INTEGER Line 37: @ATTRIBUTE "Vict Sex" STRING Line 38: @ATTRIBUTE "Vict Descent" STRING Line 39: @ATTRIBUTE "Premis Cd" REAL Line 40: @ATTRIBUTE "Premis Desc" STRING Line 41: @ATTRIBUTE "Weapon Used Cd" REAL Line 42: @ATTRIBUTE "Weapon Desc" STRING Line 43: @ATTRIBUTE Status STRING Line 44: @ATTRIBUTE "Status Desc" STRING Line 45: @ATTRIBUTE "Crm Cd 1" REAL Line 46: @ATTRIBUTE "Crm Cd 2" REAL Line 47: @ATTRIBUTE "Crm Cd 3" REAL Line 48: @ATTRIBUTE "Crm Cd 4" REAL Line 49: @ATTRIBUTE LOCATION STRING Line 50: @ATTRIBUTE "Cross Street" STRING
This process identifies where the actual dataset starts in a file containing descriptive metadata by reading the first 50 lines, revealing attribute descriptions in the @ATTRIBUTE format (typical of ARFF files), ensuring accurate data parsing for further analysis.
# Locate the starting point of the actual data
data_start = None
for i, line in enumerate(lines):
if "@DATA" in line.upper(): # ARFF files typically use '@DATA' to mark the start of data
data_start = i + 1 # Data starts after this line
break
# Display a few lines of the actual data if found
if data_start:
print(f"Data starts at line {data_start + 1}.")
for line in lines[data_start:data_start + 10]:
print(line.strip())
else:
print("No '@DATA' section found; the structure might differ.")
Data starts at line 55. 190326475,'03/01/2020 12:00:00 AM','03/01/2020 12:00:00 AM',2130,7,Wilshire,784,1,510,'VEHICLE - STOLEN',?,0,M,O,101.0,STREET,?,?,AA,'Adult Arrest',510.0,998.0,?,?,'1900 S LONGWOOD AV',?,34.0375,-118.3506 200106753,'02/09/2020 12:00:00 AM','02/08/2020 12:00:00 AM',1800,1,Central,182,1,330,'BURGLARY FROM VEHICLE','1822 1402 0344',47,M,O,128.0,'BUS STOP/LAYOVER (ALSO QUERY 124)',?,?,IC,'Invest Cont',330.0,998.0,?,?,'1000 S FLOWER ST',?,34.0444,-118.2628 200320258,'11/11/2020 12:00:00 AM','11/04/2020 12:00:00 AM',1700,3,Southwest,356,1,480,'BIKE - STOLEN','0344 1251',19,X,X,502.0,'MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)',?,?,IC,'Invest Cont',480.0,?,?,?,'1400 W 37TH ST',?,34.021,-118.3002 200907217,'05/10/2023 12:00:00 AM','03/10/2020 12:00:00 AM',2037,9,'Van Nuys',964,1,343,'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)','0325 1501',19,M,O,405.0,'CLOTHING STORE',?,?,IC,'Invest Cont',343.0,?,?,?,'14000 RIVERSIDE DR',?,34.1576,-118.4387 220614831,'08/18/2022 12:00:00 AM','08/17/2020 12:00:00 AM',1200,6,Hollywood,666,2,354,'THEFT OF IDENTITY','1822 1501 0930 2004',28,M,H,102.0,SIDEWALK,?,?,IC,'Invest Cont',354.0,?,?,?,'1900 TRANSIENT',?,34.0944,-118.3277 231808869,'04/04/2023 12:00:00 AM','12/01/2020 12:00:00 AM',2300,18,Southeast,1826,2,354,'THEFT OF IDENTITY','1822 0100 0930 0929',41,M,H,501.0,'SINGLE FAMILY DWELLING',?,?,IC,'Invest Cont',354.0,?,?,?,'9900 COMPTON AV',?,33.9467,-118.2463 230110144,'04/04/2023 12:00:00 AM','07/03/2020 12:00:00 AM',900,1,Central,182,2,354,'THEFT OF IDENTITY','0930 0929',25,M,H,502.0,'MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)',?,?,IC,'Invest Cont',354.0,?,?,?,'1100 S GRAND AV',?,34.0415,-118.262 220314085,'07/22/2022 12:00:00 AM','05/12/2020 12:00:00 AM',1110,3,Southwest,303,2,354,'THEFT OF IDENTITY',0100,27,F,B,248.0,'CELL PHONE STORE',?,?,IC,'Invest Cont',354.0,?,?,?,'2500 S SYCAMORE AV',?,34.0335,-118.3537 231309864,'04/28/2023 12:00:00 AM','12/09/2020 12:00:00 AM',1400,13,Newton,1375,2,354,'THEFT OF IDENTITY',0100,24,F,B,750.0,CYBERSPACE,?,?,IC,'Invest Cont',354.0,?,?,?,'1300 E 57TH ST',?,33.9911,-118.2521 211904005,'12/31/2020 12:00:00 AM','12/31/2020 12:00:00 AM',1220,19,Mission,1974,2,624,'BATTERY - SIMPLE ASSAULT',0416,26,M,H,502.0,'MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)',400.0,'STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)',IC,'Invest Cont',624.0,?,?,?,'9000 CEDROS AV',?,34.2336,-118.4535
This process locates the starting point of the dataset in an ARFF file by identifying the @DATA marker, confirming that data begins at line 55, and displaying initial rows of comma-separated crime records, ensuring accurate parsing for further analysis.
# Re-import necessary libraries
import pandas as pd
# Re-attempt to process the dataset
try:
# Reload and inspect the first few lines of the file
with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
for i in range(10): # Display the first 10 lines
print(file.readline().strip())
# Load the dataset by skipping metadata and identifying the start of the actual data
df = pd.read_csv(file_path, skiprows=55, delimiter=',', on_bad_lines='skip', header = None)
print("Dataset loaded successfully!")
except Exception as e:
print(f"Error occurred while processing the dataset: {e}")
% Description: % This dataset, named Crime_Data_from_2020_to_Present.csv, provides a detailed record of reported criminal incidents in a given area from the year 2020 onwards. It includes comprehensive information per incident, such as report numbers, reporting and occurrence dates and times, crime descriptions with specific codes, the locations (including area names and numbers, premises, LAT/LON coordinates), and details about the victims and suspects involved. This dataset is instrumental for analysts, public safety organizations, and researchers to understand crime patterns, allocate resources effectively, and develop crime prevention strategies. % % Attribute Description: % - DR_NO: A unique identifier for the crime report. % - Date Rptd & DATE OCC: The dates when the crime was reported and occurred. % - TIME OCC: The time when the crime occurred. % - AREA & AREA NAME: Numeric and textual descriptions of the area where the crime occurred. % - Rpt Dist No: The reporting district number. % - Part 1-2: Indicates whether the crime is a Part 1 (more severe) or Part 2 offense. Dataset loaded successfully!
This process reloads a dataset by skipping 55 metadata lines and using pandas to parse it as CSV, handling issues like bad lines with on_bad_lines='skip' and avoiding metadata headers with header=None, ensuring a clean DataFrame for analysis.
df.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200106753 | '02/09/2020 12:00:00 AM' | '02/08/2020 12:00:00 AM' | 1800 | 1 | Central | 182 | 1 | 330 | 'BURGLARY FROM VEHICLE' | ... | IC | 'Invest Cont' | 330.0 | 998.0 | ? | ? | '1000 S FLOWER ST' | ? | 34.0444 | -118.2628 |
| 1 | 200907217 | '05/10/2023 12:00:00 AM' | '03/10/2020 12:00:00 AM' | 2037 | 9 | 'Van Nuys' | 964 | 1 | 343 | 'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)' | ... | IC | 'Invest Cont' | 343.0 | ? | ? | ? | '14000 RIVERSIDE DR' | ? | 34.1576 | -118.4387 |
| 2 | 220614831 | '08/18/2022 12:00:00 AM' | '08/17/2020 12:00:00 AM' | 1200 | 6 | Hollywood | 666 | 2 | 354 | 'THEFT OF IDENTITY' | ... | IC | 'Invest Cont' | 354.0 | ? | ? | ? | '1900 TRANSIENT' | ? | 34.0944 | -118.3277 |
| 3 | 231808869 | '04/04/2023 12:00:00 AM' | '12/01/2020 12:00:00 AM' | 2300 | 18 | Southeast | 1826 | 2 | 354 | 'THEFT OF IDENTITY' | ... | IC | 'Invest Cont' | 354.0 | ? | ? | ? | '9900 COMPTON AV' | ? | 33.9467 | -118.2463 |
| 4 | 220314085 | '07/22/2022 12:00:00 AM' | '05/12/2020 12:00:00 AM' | 1110 | 3 | Southwest | 303 | 2 | 354 | 'THEFT OF IDENTITY' | ... | IC | 'Invest Cont' | 354.0 | ? | ? | ? | '2500 S SYCAMORE AV' | ? | 34.0335 | -118.3537 |
5 rows × 28 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 533557 entries, 0 to 533556 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 533557 non-null int64 1 1 533557 non-null object 2 2 533557 non-null object 3 3 533557 non-null int64 4 4 533557 non-null int64 5 5 533557 non-null object 6 6 533557 non-null int64 7 7 533557 non-null int64 8 8 533557 non-null int64 9 9 533557 non-null object 10 10 533557 non-null object 11 11 533557 non-null object 12 12 533557 non-null object 13 13 533557 non-null object 14 14 533557 non-null object 15 15 533557 non-null object 16 16 533557 non-null object 17 17 533557 non-null object 18 18 533557 non-null object 19 19 533557 non-null object 20 20 533557 non-null object 21 21 533557 non-null object 22 22 533557 non-null object 23 23 533557 non-null object 24 24 533557 non-null object 25 25 533557 non-null object 26 26 533557 non-null object 27 27 533557 non-null object dtypes: int64(6), object(22) memory usage: 114.0+ MB
3. Data Cleaning and Exploratory Analysis¶
Data cleaning is the essential process of preparing a dataset for analysis or machine learning by ensuring it is consistent, complete, and accurate. This process involves several key tasks, such as removing unnecessary or irrelevant data, filling in missing values, and standardizing metrics or measurements to create uniformity. Additionally, new features can be derived from existing data to make the dataset more useful and meaningful. By addressing errors and inconsistencies, data cleaning ensures the dataset is reliable and forms a solid foundation for further analysis or model training.
Often combined with data cleaning, exploratory analysis involves examining the dataset to uncover patterns, trends, and relationships that provide valuable insights. This step includes creating visualizations, such as graphs or plots, to identify correlations or significant variables, and spotting potential issues like outliers that may need cleaning. Insights gained during this process may guide the creation of new features or adjustments to existing ones, refining the dataset for better performance in a machine learning model. By integrating these two steps, we not only ensure the dataset is clean but also well-understood, which is critical for building effective models or conducting insightful analysis.
Key Steps in Data Cleaning:¶
- Remove unnecessary or irrelevant data.
- Fill in missing values to address gaps.
- Standardize metrics or measurements.
- Create new features from existing data to enhance usability.
- Ensure data accuracy and reliability.
Key Steps in Exploratory Analysis:¶
- Visualize data through graphs and plots to uncover patterns and relationships.
- Identify significant features or correlations.
- Detect issues like outliers or irregularities for further cleaning.
- Refine research questions based on insights from the data.
- Create or adjust features to align with identified trends or insights.
- Combine exploratory analysis with cleaning for a comprehensive understanding of the dataset.
df.columns = [
"DR_NO", "Date_Rptd", "Date_Occ", "Time_Occ", "Area", "Area_Name",
"Rpt_Dist_No", "Part_1_2", "Crm_Cd", "Crm_Cd_Desc", "Mocodes", "Vict_Age",
"Vict_Sex", "Vict_Descent", "Premis_Cd", "Premis_Desc", "Weapon_Used_Cd",
"Weapon_Desc", "Status", "Status_Desc", "Crm_Cd_1", "Crm_Cd_2", "Crm_Cd_3",
"Crm_Cd_4", "Location", "Cross_Street", "Lat", "Lon"
]
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 533557 entries, 0 to 533556 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DR_NO 533557 non-null int64 1 Date_Rptd 533557 non-null object 2 Date_Occ 533557 non-null object 3 Time_Occ 533557 non-null int64 4 Area 533557 non-null int64 5 Area_Name 533557 non-null object 6 Rpt_Dist_No 533557 non-null int64 7 Part_1_2 533557 non-null int64 8 Crm_Cd 533557 non-null int64 9 Crm_Cd_Desc 533557 non-null object 10 Mocodes 533557 non-null object 11 Vict_Age 533557 non-null object 12 Vict_Sex 533557 non-null object 13 Vict_Descent 533557 non-null object 14 Premis_Cd 533557 non-null object 15 Premis_Desc 533557 non-null object 16 Weapon_Used_Cd 533557 non-null object 17 Weapon_Desc 533557 non-null object 18 Status 533557 non-null object 19 Status_Desc 533557 non-null object 20 Crm_Cd_1 533557 non-null object 21 Crm_Cd_2 533557 non-null object 22 Crm_Cd_3 533557 non-null object 23 Crm_Cd_4 533557 non-null object 24 Location 533557 non-null object 25 Cross_Street 533557 non-null object 26 Lat 533557 non-null object 27 Lon 533557 non-null object dtypes: int64(6), object(22) memory usage: 114.0+ MB
Renames the columns of the DataFrame df to a specified list of column names and displays a summary of the DataFrame structure using df.info().
df['Crm_Cd_Desc'].unique()
array(["'BURGLARY FROM VEHICLE'",
"'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)'",
"'THEFT OF IDENTITY'", "'VEHICLE - STOLEN'",
"'CRIMINAL THREATS - NO WEAPON DISPLAYED'",
"'THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)'",
"'CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 YRS OLDER)'",
'BURGLARY', "'THEFT PLAIN - PETTY ($950 & UNDER)'",
"'LEWD CONDUCT'", "'THEFT PLAIN - ATTEMPT'",
"'THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)'",
"'CHILD ANNOYING (17YRS & UNDER)'", "'OTHER MISCELLANEOUS CRIME'",
'ROBBERY', "'UNAUTHORIZED COMPUTER ACCESS'",
"'VIOLATION OF RESTRAINING ORDER'",
"'SHOPLIFTING - PETTY THEFT ($950 & UNDER)'", "'BRANDISH WEAPON'",
"'DOCUMENT FORGERY / STOLEN FELONY'",
"'SEX OFFENDER REGISTRANT OUT OF COMPLIANCE'",
"'VANDALISM - MISDEAMEANOR ($399 OR UNDER)'",
"'CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT'", "'BIKE - STOLEN'",
'EXTORTION', 'PICKPOCKET', 'ARSON', "'DISTURBING THE PEACE'",
"'PEEPING TOM'", "'ORAL COPULATION'", "'VIOLATION OF COURT ORDER'",
"'INTIMATE PARTNER - SIMPLE ASSAULT'", "'FALSE POLICE REPORT'",
"'INTIMATE PARTNER - AGGRAVATED ASSAULT'", 'CONTRIBUTING',
"'FALSE IMPRISONMENT'", "'ATTEMPTED ROBBERY'", "'CHILD STEALING'",
"'INDECENT EXPOSURE'", "'CHILD NEGLECT (SEE 300 W.I.C.)'",
"'DISHONEST EMPLOYEE - GRAND THEFT'", 'TRESPASSING',
"'BATTERY - SIMPLE ASSAULT'", "'CONTEMPT OF COURT'",
"'THREATENING PHONE CALLS/LETTERS'", 'PIMPING',
"'VEHICLE - ATTEMPT STOLEN'", 'PANDERING',
"'LEWD/LASCIVIOUS ACTS WITH CHILD'",
"'HUMAN TRAFFICKING - COMMERCIAL SEX ACTS'",
"'FIREARMS RESTRAINING ORDER (FIREARMS RO)'",
"'DISCHARGE FIREARMS/SHOTS FIRED'", "'FAILURE TO YIELD'",
"'BOMB SCARE'", "'ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER'",
"'OTHER ASSAULT'", "'BATTERY POLICE (SIMPLE)'",
"'THEFT FROM PERSON - ATTEMPT'",
"'SHOTS FIRED AT INHABITED DWELLING'",
"'CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT'",
"'TILL TAP - GRAND THEFT ($950.01 & OVER)'",
"'VIOLATION OF TEMPORARY RESTRAINING ORDER'", "'RESISTING ARREST'",
"'THROWING OBJECT AT MOVING VEHICLE'",
"'DOCUMENT WORTHLESS ($200.01 & OVER)'",
"'SEXUAL PENETRATION W/FOREIGN OBJECT'", 'KIDNAPPING',
"'CRIMINAL HOMICIDE'", "'PURSE SNATCHING'",
"'THEFT FROM MOTOR VEHICLE - ATTEMPT'",
"'SODOMY/SEXUAL CONTACT B/W PENIS OF ONE PERS TO ANUS OTH'",
"'DRIVING WITHOUT OWNER CONSENT (DWOC)'", "'RECKLESS DRIVING'",
'STALKING', "'SHOPLIFTING - ATTEMPT'", "'CHILD PORNOGRAPHY'",
"'BATTERY WITH SEXUAL CONTACT'", 'COUNTERFEIT',
"'CRUELTY TO ANIMALS'", "'BOAT - STOLEN'", "'ILLEGAL DUMPING'",
'PROWLER', "'DOCUMENT WORTHLESS ($200 & UNDER)'",
"'BATTERY ON A FIREFIGHTER'", "'PETTY THEFT - AUTO REPAIR'",
"'TILL TAP - PETTY ($950 & UNDER)'",
"'KIDNAPPING - GRAND ATTEMPT'",
"'DISHONEST EMPLOYEE - PETTY THEFT'",
"'HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE'",
"'WEAPONS POSSESSION/BOMBING'", "'BIKE - ATTEMPTED STOLEN'",
"'GRAND THEFT / AUTO REPAIR'", 'CONSPIRACY', 'BRIBERY',
"'PURSE SNATCHING - ATTEMPT'", "'GRAND THEFT / INSURANCE FRAUD'",
"'DRUNK ROLL'", "'CHILD ABANDONMENT'", "'DISRUPT SCHOOL'",
"'FAILURE TO DISPERSE'",
"'FIREARMS EMERGENCY PROTECTIVE ORDER (FIREARMS EPO)'", 'BIGAMY',
"'VANDALISM - FELONY ($400 & OVER", "'ASSAULT WITH DEADLY WEAPON",
"'BURGLARY", "'CREDIT CARDS", "'EMBEZZLEMENT", "'BUNCO", "'THEFT",
"'BURGLARY FROM VEHICLE", "'RAPE",
"'SHOTS FIRED AT MOVING VEHICLE",
"'DEFRAUDING INNKEEPER/THEFT OF SERVICES", "'BEASTIALITY",
"'INCEST (SEXUAL ACTS BETWEEN BLOOD RELATIVES)'", "'DRUGS",
"'TELEPHONE PROPERTY - DAMAGE'", "'INCITING A RIOT'",
"'DISHONEST EMPLOYEE ATTEMPTED THEFT'",
"'BLOCKING DOOR INDUCTION CENTER'", "'LYNCHING - ATTEMPTED'",
'LYNCHING', "'TRAIN WRECKING'", "'LETTERS", "'SEX"], dtype=object)
Retrieves and displays all the unique values in the Crm_Cd_Desc column of the DataFrame df, which represent the unique descriptions of crime categories in the dataset.
df.drop(columns = ['Weapon_Used_Cd', 'Weapon_Desc', 'Crm_Cd_1', 'Crm_Cd_2', 'Crm_Cd_3', 'Crm_Cd_4', 'Cross_Street'], inplace = True, axis = 1)
df['Vict_Descent'] = df['Vict_Descent'].fillna('None')
df['Vict_Sex'] = df['Vict_Sex'].fillna('None')
df['Mocodes'] = df['Mocodes'].fillna('none')
df['Premis_Desc'] = df['Premis_Desc'].fillna('None')
df['Date_Rptd'] = pd.to_datetime(df['Date_Rptd'].str[:11])
df['Date_Occ'] = pd.to_datetime(df['Date_Occ'].str[:11])
df.head()
| DR_NO | Date_Rptd | Date_Occ | Time_Occ | Area | Area_Name | Rpt_Dist_No | Part_1_2 | Crm_Cd | Crm_Cd_Desc | ... | Vict_Age | Vict_Sex | Vict_Descent | Premis_Cd | Premis_Desc | Status | Status_Desc | Location | Lat | Lon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200106753 | 2020-02-09 | 2020-02-08 | 1800 | 1 | Central | 182 | 1 | 330 | 'BURGLARY FROM VEHICLE' | ... | 47 | M | O | 128.0 | 'BUS STOP/LAYOVER (ALSO QUERY 124)' | IC | 'Invest Cont' | '1000 S FLOWER ST' | 34.0444 | -118.2628 |
| 1 | 200907217 | 2023-05-10 | 2020-03-10 | 2037 | 9 | 'Van Nuys' | 964 | 1 | 343 | 'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)' | ... | 19 | M | O | 405.0 | 'CLOTHING STORE' | IC | 'Invest Cont' | '14000 RIVERSIDE DR' | 34.1576 | -118.4387 |
| 2 | 220614831 | 2022-08-18 | 2020-08-17 | 1200 | 6 | Hollywood | 666 | 2 | 354 | 'THEFT OF IDENTITY' | ... | 28 | M | H | 102.0 | SIDEWALK | IC | 'Invest Cont' | '1900 TRANSIENT' | 34.0944 | -118.3277 |
| 3 | 231808869 | 2023-04-04 | 2020-12-01 | 2300 | 18 | Southeast | 1826 | 2 | 354 | 'THEFT OF IDENTITY' | ... | 41 | M | H | 501.0 | 'SINGLE FAMILY DWELLING' | IC | 'Invest Cont' | '9900 COMPTON AV' | 33.9467 | -118.2463 |
| 4 | 220314085 | 2022-07-22 | 2020-05-12 | 1110 | 3 | Southwest | 303 | 2 | 354 | 'THEFT OF IDENTITY' | ... | 27 | F | B | 248.0 | 'CELL PHONE STORE' | IC | 'Invest Cont' | '2500 S SYCAMORE AV' | 34.0335 | -118.3537 |
5 rows × 21 columns
The DataFrame by dropping unnecessary columns, filling missing values with 'None', converting date columns to datetime, and previewing the cleaned data.
df.isnull().sum()
DR_NO 0 Date_Rptd 0 Date_Occ 0 Time_Occ 0 Area 0 Area_Name 0 Rpt_Dist_No 0 Part_1_2 0 Crm_Cd 0 Crm_Cd_Desc 0 Mocodes 0 Vict_Age 0 Vict_Sex 0 Vict_Descent 0 Premis_Cd 0 Premis_Desc 0 Status 0 Status_Desc 0 Location 0 Lat 0 Lon 0 dtype: int64
Checks for missing values in the DataFrame df by using df.isnull().sum(). It outputs the total count of missing values for each column. The result shows that all columns have 0 missing values, indicating the dataset has been successfully cleaned of any null or missing data.
df_cleaned = df.dropna()
df_cleaned['Vict_Age'] = pd.to_numeric(df_cleaned['Vict_Age'], errors='coerce').astype('Int64')
df_cleaned['Lat'] = pd.to_numeric(df_cleaned['Lat'], errors='coerce')
df_cleaned['Lon'] = pd.to_numeric(df_cleaned['Lon'], errors='coerce')
df_cleaned['Vict_Sex'] = df_cleaned['Vict_Sex'].astype('category')
df_cleaned['Vict_Descent'] = df_cleaned['Vict_Descent'].astype('category')
print(df_cleaned.isnull().sum())
print(df_cleaned.info())
DR_NO 0 Date_Rptd 0 Date_Occ 0 Time_Occ 0 Area 0 Area_Name 0 Rpt_Dist_No 0 Part_1_2 0 Crm_Cd 0 Crm_Cd_Desc 0 Mocodes 0 Vict_Age 9550 Vict_Sex 0 Vict_Descent 0 Premis_Cd 0 Premis_Desc 0 Status 0 Status_Desc 0 Location 0 Lat 14420 Lon 2185 dtype: int64 <class 'pandas.core.frame.DataFrame'> RangeIndex: 533557 entries, 0 to 533556 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DR_NO 533557 non-null int64 1 Date_Rptd 533557 non-null datetime64[ns] 2 Date_Occ 533557 non-null datetime64[ns] 3 Time_Occ 533557 non-null int64 4 Area 533557 non-null int64 5 Area_Name 533557 non-null object 6 Rpt_Dist_No 533557 non-null int64 7 Part_1_2 533557 non-null int64 8 Crm_Cd 533557 non-null int64 9 Crm_Cd_Desc 533557 non-null object 10 Mocodes 533557 non-null object 11 Vict_Age 524007 non-null Int64 12 Vict_Sex 533557 non-null category 13 Vict_Descent 533557 non-null category 14 Premis_Cd 533557 non-null object 15 Premis_Desc 533557 non-null object 16 Status 533557 non-null object 17 Status_Desc 533557 non-null object 18 Location 533557 non-null object 19 Lat 519137 non-null float64 20 Lon 531372 non-null float64 dtypes: Int64(1), category(2), datetime64[ns](2), float64(2), int64(6), object(8) memory usage: 79.4+ MB None
Cleans the dataset by removing null values, converting numeric columns (Vict_Age, Lat, Lon) to integers, and optimizing Vict_Sex and Vict_Descent as categorical data types.
df_cleaned = df_cleaned.dropna(subset=['Lat', 'Lon', 'Vict_Age'])
# Verify the cleaned DataFrame
print(df_cleaned.isnull().sum())
print(df_cleaned.info())
DR_NO 0 Date_Rptd 0 Date_Occ 0 Time_Occ 0 Area 0 Area_Name 0 Rpt_Dist_No 0 Part_1_2 0 Crm_Cd 0 Crm_Cd_Desc 0 Mocodes 0 Vict_Age 0 Vict_Sex 0 Vict_Descent 0 Premis_Cd 0 Premis_Desc 0 Status 0 Status_Desc 0 Location 0 Lat 0 Lon 0 dtype: int64 <class 'pandas.core.frame.DataFrame'> Index: 519136 entries, 0 to 533556 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DR_NO 519136 non-null int64 1 Date_Rptd 519136 non-null datetime64[ns] 2 Date_Occ 519136 non-null datetime64[ns] 3 Time_Occ 519136 non-null int64 4 Area 519136 non-null int64 5 Area_Name 519136 non-null object 6 Rpt_Dist_No 519136 non-null int64 7 Part_1_2 519136 non-null int64 8 Crm_Cd 519136 non-null int64 9 Crm_Cd_Desc 519136 non-null object 10 Mocodes 519136 non-null object 11 Vict_Age 519136 non-null Int64 12 Vict_Sex 519136 non-null category 13 Vict_Descent 519136 non-null category 14 Premis_Cd 519136 non-null object 15 Premis_Desc 519136 non-null object 16 Status 519136 non-null object 17 Status_Desc 519136 non-null object 18 Location 519136 non-null object 19 Lat 519136 non-null float64 20 Lon 519136 non-null float64 dtypes: Int64(1), category(2), datetime64[ns](2), float64(2), int64(6), object(8) memory usage: 81.2+ MB None
Removes rows with missing values in the Lat, Lon, and Vict_Age columns from the df_cleaned DataFrame, then verifies the cleaned dataset by printing the count of missing values and displaying the DataFrame's structure and summary using df.info().
EDA¶
Step 1: Exploratory Data Analysis¶
eda_results = {
"Crime Type Frequency": df['Crm_Cd_Desc'].value_counts().head(10),
"Area Crime Count": df['Area_Name'].value_counts(),
"Victim Age Statistics": df['Vict_Age'].describe(),
"Crimes by Time of Day": df['Time_Occ'].value_counts(bins=4).sort_index(),
"Top Premises for Crimes": df['Premis_Desc'].value_counts().head(10)
}
# Prepare for time-series analysis
df['Year_Month'] = df['Date_Occ'].dt.to_period('M')
crimes_by_month = df.groupby('Year_Month').size()
crimes_by_month
Year_Month 2020-01 10044 2020-02 9252 2020-03 8867 2020-04 8772 2020-05 9687 2020-06 9397 2020-07 9301 2020-08 8802 2020-09 8190 2020-10 8889 2020-11 8603 2020-12 8977 2021-01 9794 2021-02 9171 2021-03 9602 2021-04 9298 2021-05 9701 2021-06 9602 2021-07 10328 2021-08 10222 2021-09 10477 2021-10 11161 2021-11 10850 2021-12 10818 2022-01 10484 2022-02 10086 2022-03 11172 2022-04 11214 2022-05 11552 2022-06 11248 2022-07 11110 2022-08 11404 2022-09 10858 2022-10 11361 2022-11 10835 2022-12 11690 2023-01 11935 2023-02 10872 2023-03 11146 2023-04 10899 2023-05 10733 2023-06 10619 2023-07 11287 2023-08 11585 2023-09 11024 2023-10 11633 2023-11 11268 2023-12 11552 2024-01 11937 2024-02 10749 2024-03 10084 2024-04 3415 Freq: M, dtype: int64
Performed EDA by summarizing crime frequencies, victim age statistics, and crime timings while preparing the dataset for time-series analysis by grouping crimes by month.
eda_results
{'Crime Type Frequency': Crm_Cd_Desc
'VEHICLE - STOLEN' 100157
'BURGLARY FROM VEHICLE' 54784
BURGLARY 47246
'THEFT OF IDENTITY' 42839
'THEFT PLAIN - PETTY ($950 & UNDER)' 35574
'THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)' 35050
'THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)' 31752
'SHOPLIFTING - PETTY THEFT ($950 & UNDER)' 23182
ROBBERY 17190
'VANDALISM - MISDEAMEANOR ($399 OR UNDER)' 14468
Name: count, dtype: int64,
'Area Crime Count': Area_Name
Central 33578
Pacific 33206
'77th Street' 31469
Wilshire 27885
'N Hollywood' 27783
Southwest 27020
Newton 26984
'West LA' 26039
Hollywood 25816
Northeast 25662
Southeast 25431
Devonshire 24734
Olympic 23803
'Van Nuys' 23491
'West Valley' 23398
Topanga 23031
Harbor 21714
Mission 21468
Rampart 21413
Hollenbeck 20713
Foothill 18919
Name: count, dtype: int64,
'Victim Age Statistics': count 533557
unique 6741
top 0
freq 168871
Name: Vict_Age, dtype: int64,
'Crimes by Time of Day': (-1.359, 590.5] 80687
(590.5, 1180.0] 109172
(1180.0, 1769.5] 172345
(1769.5, 2359.0] 171353
Name: count, dtype: int64,
'Top Premises for Crimes': Premis_Desc
STREET 178786
'SINGLE FAMILY DWELLING' 96667
'PARKING LOT' 46842
'OTHER BUSINESS' 26576
GARAGE/CARPORT 15602
SIDEWALK 14434
DRIVEWAY 11586
'DEPARTMENT STORE' 10939
'RESTAURANT/FAST FOOD' 6860
'PARKING UNDERGROUND/BUILDING' 6778
Name: count, dtype: int64}
The eda_results dictionary, which contains key insights from the dataset, such as the top 10 crime types, crime counts by area, victim age statistics, crime distributions by time of day, and the top premises for crimes.
Graph 1: Top 10 Crime Type¶
plt.figure(figsize=(10, 10))
df['Crm_Cd_Desc'].value_counts().head(10).plot(kind='bar', title="Top 10 Crime Types")
plt.xlabel('Crime Type')
plt.ylabel('Number of Incidents')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
The bar chart illustrates the top 10 most frequent crime types, with "Vehicle - Stolen" leading significantly, surpassing 100,000 reported incidents. This highlights vehicle theft as a prominent issue in the dataset's coverage area. Following this, crimes like "Burglary from Vehicle", "Burglary", and "Theft of Identity" also show high frequencies, emphasizing a pattern of property-related offenses and vulnerabilities in vehicle and property security.
Petty theft-related crimes, including "Theft Plain - Petty ($950 & Under)", "Theft from Motor Vehicle", and "Shoplifting - Petty Theft", are also prevalent, reflecting opportunistic behaviors targeting easily accessible items. Less frequent but still notable offenses, such as "Robbery" and "Vandalism - Misdemeanor ($399 or Under)", further underscore the dominance of property crimes in the area, suggesting a need for focused preventive measures.
Graph 2: Crime Count by Area¶
plt.figure(figsize=(15, 10))
df['Area_Name'].value_counts().plot(kind='bar', title="Crime Count by Area")
plt.xlabel('Area Name')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
The bar chart visualizes the number of crimes reported in different areas, with the area names on the x-axis and the number of crimes on the y-axis. The findings indicate that "Central" and "Pacific" areas have the highest crime counts, each with over 30,000 incidents, making them hotspots for criminal activity. These are followed closely by "77th Street", "Wilshire", and "West Hollywood", which also show significantly high crime rates.
Other areas, such as "Southwest", "Newton", and "West LA", report moderately high crime counts, while areas like "Foothill" and "Rampart" have relatively lower counts compared to the leading areas. This distribution suggests that certain regions experience a disproportionate amount of crime, highlighting the need for targeted law enforcement and community safety initiatives in these high-crime areas.
Graph 3: Victim Age Distribution¶
df['Vict_Age'] = pd.to_numeric(df['Vict_Age'], errors='coerce')
df['Vict_Age'] = df['Vict_Age'].fillna(df['Vict_Age'].median())
df['Lat'] = pd.to_numeric(df['Lat'], errors='coerce')
df['Lon'] = pd.to_numeric(df['Lon'], errors='coerce')
df = df.dropna(subset=['Lat', 'Lon'])
df = df[(df['Vict_Age'] > 0) & (df['Vict_Age'] <= 100)]
print(df['Vict_Age'].describe())
df.reset_index(drop=True, inplace=True)
plt.figure(figsize=(10, 6))
df['Vict_Age'].plot(kind='hist', bins=20, title="Victim Age Distribution", color='blue')
plt.xlabel('Victim Age')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
count 326863.000000 mean 40.862783 std 15.545000 min 2.000000 25% 29.000000 50% 38.000000 75% 51.000000 max 99.000000 Name: Vict_Age, dtype: float64
The bar chart visualizes the age distribution of crime victims, with preprocessing steps including converting Vict_Age to numeric values, filling missing ages with the median, and filtering for realistic values between 0 and 100. This ensures a clean and accurate representation of the data.
The histogram reveals that most crime victims are between 20 and 40 years old, peaking around 30, indicating young adults are the most affected group. Victim frequency declines steadily beyond 40 and drops significantly after 60, suggesting lower victimization rates among older individuals. These findings underscore the need for targeted safety measures for young adults, who are at a higher risk of crime.
Graph 4: Crimes by Time of Day¶
time_bins = [0, 600, 1200, 1800, 2400]
time_labels = ['Midnight to Morning', 'Morning to Noon', 'Noon to Evening', 'Evening to Midnight']
df['Time_Binned'] = pd.cut(df['Time_Occ'], bins=time_bins, labels=time_labels, right=False)
# Plot the cleaned Time of Day distribution
plt.figure(figsize=(10, 6))
df['Time_Binned'].value_counts().sort_index().plot(kind='bar', title="Crimes by Time of Day (Cleaned)")
plt.xlabel('Time of Day')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
The bar chart visualizes the distribution of crimes across four time intervals: Midnight to Morning, Morning to Noon, Noon to Evening, and Evening to Midnight. The data reveals that most crimes occur Noon to Evening, followed by Evening to Midnight, indicating a higher crime rate during the latter part of the day.
In contrast, fewer crimes are reported Morning to Noon, with the lowest frequency occurring Midnight to Morning. This trend suggests that criminal activity peaks in the afternoon and evening hours, tapering off during the early morning, potentially reflecting variations in daily routines and societal activity levels.
Graph 5: Crimes by Year and Month¶
# Remove the last data point (potentially incomplete month/year) from the time series
filtered_crimes_by_month = crimes_by_month.iloc[:-1]
# Plot the filtered Crimes by Year and Month
plt.figure(figsize=(12, 6))
filtered_crimes_by_month.plot(kind='line', title="Crimes by Year and Month (Filtered)")
plt.xlabel('Year-Month')
plt.ylabel('Number of Crimes')
plt.grid()
plt.tight_layout()
plt.show()
The line chart visualizes the monthly trend of reported crimes from 2020 to 2023, excluding the final incomplete data point for precision. The data reveals a clear upward trajectory in crime rates over the years, punctuated by occasional dips and spikes, suggesting periods of varying criminal activity.
These fluctuations hint at potential seasonality or specific factors influencing crime rates, while the overall increase underscores a growing concern. This trend highlights the importance of sustained efforts to address and mitigate criminal activity in the region.
# Create a sparse matrix for area and crime type
area_crime_matrix = pd.crosstab(df['Area_Name'], df['Crm_Cd_Desc'])
sparse_matrix = csr_matrix(area_crime_matrix.values)
# Calculate additional metrics
metrics = {
"Total Records": len(df),
"Total Unique Crime Types": df['Crm_Cd_Desc'].nunique(),
"Total Unique Areas": df['Area_Name'].nunique(),
"Missing Values": df.isnull().sum().sum(),
"Density of Sparse Matrix": (sparse_matrix.nnz / np.prod(sparse_matrix.shape)),
}
# Sparse Matrix Dimensions
sparse_matrix_shape = sparse_matrix.shape
metrics_output = {
"Total Records": metrics["Total Records"],
"Total Unique Crime Types": metrics["Total Unique Crime Types"],
"Total Unique Areas": metrics["Total Unique Areas"],
"Missing Values": metrics["Missing Values"],
"Density of Sparse Matrix": metrics["Density of Sparse Matrix"],
"Sparse Matrix Shape": sparse_matrix_shape,
}
This code creates a sparse matrix to analyze the relationship between areas and crime types, calculates metrics, and outputs key dataset statistics:
Sparse Matrix Creation:
- A crosstabulation is created using
pd.crosstabto mapArea_Name(rows) toCrm_Cd_Desc(columns), showing the frequency of each crime type in each area. - The resulting matrix is converted into a sparse matrix format using
csr_matrixfor efficient storage.
- A crosstabulation is created using
Metrics Calculation:
- Total Records: The number of rows in the dataset.
- Total Unique Crime Types: The number of distinct crime types.
- Total Unique Areas: The number of unique areas.
- Missing Values: The total number of missing values in the dataset.
- Density of Sparse Matrix: The ratio of non-zero elements to the total elements in the sparse matrix, indicating how "dense" the matrix is.
Output:
- Outputs the calculated metrics and the dimensions of the sparse matrix for further analysis or reporting.
This step summarizes the dataset's structure and provides a compressed representation of the area-crime relationships, useful for efficient data manipulation and machine learning applications.
area_crime_matrix
| Crm_Cd_Desc | 'ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER' | 'ATTEMPTED ROBBERY' | 'BATTERY - SIMPLE ASSAULT' | 'BATTERY ON A FIREFIGHTER' | 'BATTERY POLICE (SIMPLE)' | 'BATTERY WITH SEXUAL CONTACT' | 'BIKE - ATTEMPTED STOLEN' | 'BIKE - STOLEN' | 'BLOCKING DOOR INDUCTION CENTER' | 'BOMB SCARE' | ... | COUNTERFEIT | EXTORTION | KIDNAPPING | PANDERING | PICKPOCKET | PIMPING | PROWLER | ROBBERY | STALKING | TRESPASSING |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Area_Name | |||||||||||||||||||||
| '77th Street' | 8 | 237 | 319 | 0 | 5 | 2 | 0 | 24 | 0 | 0 | ... | 0 | 76 | 11 | 3 | 13 | 42 | 4 | 1579 | 31 | 159 |
| 'N Hollywood' | 2 | 71 | 248 | 1 | 0 | 3 | 0 | 322 | 0 | 0 | ... | 1 | 72 | 7 | 0 | 31 | 0 | 7 | 416 | 8 | 391 |
| 'Van Nuys' | 1 | 62 | 189 | 1 | 2 | 2 | 0 | 203 | 0 | 11 | ... | 4 | 62 | 6 | 8 | 22 | 8 | 1 | 417 | 14 | 322 |
| 'West LA' | 0 | 46 | 249 | 1 | 1 | 4 | 1 | 831 | 0 | 1 | ... | 2 | 58 | 1 | 1 | 89 | 4 | 13 | 212 | 12 | 460 |
| 'West Valley' | 1 | 67 | 187 | 0 | 3 | 3 | 0 | 164 | 4 | 2 | ... | 2 | 92 | 7 | 0 | 25 | 1 | 61 | 367 | 9 | 415 |
| Central | 41 | 205 | 807 | 7 | 13 | 5 | 0 | 673 | 0 | 17 | ... | 0 | 22 | 15 | 0 | 599 | 3 | 0 | 1139 | 11 | 299 |
| Devonshire | 1 | 60 | 192 | 2 | 2 | 2 | 0 | 112 | 0 | 0 | ... | 0 | 89 | 7 | 0 | 37 | 1 | 7 | 249 | 14 | 371 |
| Foothill | 2 | 58 | 173 | 0 | 3 | 2 | 0 | 38 | 0 | 5 | ... | 4 | 86 | 16 | 0 | 5 | 0 | 0 | 333 | 14 | 218 |
| Harbor | 0 | 76 | 183 | 0 | 0 | 0 | 0 | 89 | 0 | 1 | ... | 1 | 59 | 7 | 0 | 19 | 0 | 3 | 342 | 11 | 185 |
| Hollenbeck | 12 | 79 | 212 | 1 | 2 | 4 | 0 | 48 | 0 | 1 | ... | 0 | 61 | 7 | 0 | 25 | 0 | 0 | 465 | 7 | 124 |
| Hollywood | 7 | 111 | 397 | 1 | 7 | 18 | 0 | 318 | 0 | 3 | ... | 1 | 55 | 17 | 3 | 585 | 15 | 1 | 758 | 38 | 413 |
| Mission | 11 | 92 | 140 | 0 | 3 | 1 | 0 | 79 | 0 | 8 | ... | 4 | 107 | 8 | 0 | 8 | 0 | 0 | 439 | 24 | 269 |
| Newton | 3 | 186 | 370 | 3 | 0 | 1 | 0 | 55 | 0 | 0 | ... | 0 | 62 | 10 | 0 | 120 | 0 | 0 | 1098 | 8 | 109 |
| Northeast | 2 | 87 | 221 | 1 | 6 | 2 | 2 | 256 | 0 | 1 | ... | 2 | 74 | 3 | 0 | 166 | 0 | 9 | 360 | 26 | 298 |
| Olympic | 5 | 139 | 352 | 0 | 1 | 7 | 0 | 250 | 0 | 2 | ... | 2 | 48 | 12 | 1 | 190 | 10 | 2 | 701 | 17 | 146 |
| Pacific | 13 | 70 | 295 | 2 | 2 | 12 | 1 | 1124 | 0 | 17 | ... | 0 | 62 | 3 | 1 | 102 | 6 | 16 | 364 | 9 | 289 |
| Rampart | 2 | 177 | 365 | 0 | 2 | 2 | 0 | 182 | 0 | 1 | ... | 1 | 22 | 10 | 0 | 111 | 2 | 1 | 789 | 11 | 139 |
| Southeast | 8 | 170 | 229 | 1 | 0 | 0 | 0 | 22 | 0 | 2 | ... | 1 | 74 | 20 | 7 | 7 | 9 | 0 | 1140 | 35 | 155 |
| Southwest | 47 | 147 | 376 | 3 | 15 | 2 | 0 | 740 | 1 | 13 | ... | 3 | 101 | 5 | 2 | 255 | 0 | 10 | 927 | 32 | 413 |
| Topanga | 25 | 54 | 201 | 0 | 2 | 1 | 0 | 111 | 0 | 9 | ... | 13 | 89 | 2 | 0 | 39 | 0 | 8 | 416 | 13 | 422 |
| Wilshire | 19 | 96 | 254 | 1 | 11 | 2 | 0 | 289 | 0 | 9 | ... | 7 | 57 | 11 | 0 | 212 | 0 | 4 | 673 | 28 | 560 |
21 rows × 107 columns
The area_crime_matrix presents a crosstabulation of Area_Name (rows) and Crm_Cd_Desc (columns), detailing the frequency of each crime type in different areas. Each cell indicates how often a specific crime occurred in a given area, offering a granular view of crime distribution.
Key findings reveal that areas like Central, Wilshire, and Pacific exhibit higher counts across multiple crime types, marking them as crime hotspots. Conversely, certain areas report low or zero occurrences for specific crimes, highlighting regional variations in crime patterns. This matrix serves as a valuable tool for targeted interventions and area-specific crime analysis.
metrics
{'Total Records': 326863,
'Total Unique Crime Types': 107,
'Total Unique Areas': 21,
'Missing Values': 0,
'Density of Sparse Matrix': 0.7427681352914998}
The metrics output summarizes the dataset with 326,863 records, 107 crime types, 21 areas, no missing values, and a sparse matrix density of 74%.
sparse_matrix_shape
(21, 107)
The sparse_matrix_shape output shows that the sparse matrix has 21 rows (areas) and 107 columns (crime types).
metrics_output
{'Total Records': 326863,
'Total Unique Crime Types': 107,
'Total Unique Areas': 21,
'Missing Values': 0,
'Density of Sparse Matrix': 0.7427681352914998,
'Sparse Matrix Shape': (21, 107)}
The metrics_output summarizes the dataset with 326,863 records, 107 crime types, 21 areas, no missing values, a sparse matrix density of 74.28%, and dimensions of (21, 107).
# Removing extra quotes if any
df['Area_Name'] = df['Area_Name'].str.replace("'", "")
df['Crm_Cd_Desc'] = df['Crm_Cd_Desc'].str.replace("'", "")
# Create a sparse matrix (Area vs. Crime Type)
area_crime_matrix = pd.crosstab(df['Area_Name'], df['Crm_Cd_Desc'])
sparse_matrix = csr_matrix(area_crime_matrix.values)
# Plot the sparse matrix as a heatmap
plt.figure(figsize=(12, 8))
plt.imshow(area_crime_matrix.values, cmap="YlGnBu", aspect="auto")
plt.colorbar(label="Crime Count")
plt.xticks(range(area_crime_matrix.columns.size), area_crime_matrix.columns, rotation=90, fontsize=8)
plt.yticks(range(area_crime_matrix.index.size), area_crime_matrix.index, fontsize=10)
plt.title("Area vs Crime Type (Heatmap)", fontsize=14)
plt.xlabel("Crime Type", fontsize=12)
plt.ylabel("Area", fontsize=12)
plt.tight_layout()
plt.show()
This heatmap visualizes the relationship between areas and crime types, with preprocessing steps including the removal of extra quotes from Area_Name and Crm_Cd_Desc and the creation of a sparse matrix where rows represent areas, columns represent crime types, and values indicate crime counts.
The heatmap uses the YlGnBu color scheme, with darker shades signifying higher crime counts. It highlights areas like Central and Wilshire, which show higher activity across multiple crime types. Most crimes are sparsely distributed, with a few types dominating specific areas. This visualization effectively identifies patterns and hotspots, aiding targeted analysis and intervention strategies.
# Identify the top 10 crime types
top_10_crime_types = df['Crm_Cd_Desc'].value_counts().head(10).index
# Filter the area-crime matrix for the top 10 crime types
filtered_area_crime_matrix = area_crime_matrix[top_10_crime_types]
# Plot the filtered matrix as a heatmap
plt.figure(figsize=(12, 8))
plt.imshow(filtered_area_crime_matrix.values, cmap="YlGnBu", aspect="auto")
plt.colorbar(label="Crime Count")
plt.xticks(range(filtered_area_crime_matrix.columns.size), filtered_area_crime_matrix.columns, rotation=45, ha="right")
plt.yticks(range(filtered_area_crime_matrix.index.size), filtered_area_crime_matrix.index)
plt.title("Top 10 Crime Types by Area (Heatmap)", fontsize=14)
plt.xlabel("Crime Type", fontsize=12)
plt.ylabel("Area", fontsize=12)
plt.tight_layout()
plt.show()
This heatmap visualizes the distribution of the top 10 most frequent crime types across different areas, focusing on high-frequency crimes. The data was filtered to include only the top 10 crime types, creating a focused representation of key patterns. The x-axis represents these crime types, while the y-axis represents various areas, with darker shades in the YlGnBu color scheme indicating higher crime counts.
Key insights reveal that areas like Central, Wilshire, and 77th Street exhibit heightened activity across multiple crime types, particularly Burglary from Vehicle and Theft of Identity. In contrast, crimes such as Robbery and Vandalism appear more localized to specific areas. This visualization highlights crime hotspots for the most common offenses, offering valuable insights for targeted prevention and intervention strategies.
4. Model¶
Analysis based on Hypothesis¶
Relationship Betweeen Crime Type and Area¶
- Hypothesis: Specific crime types are concentrated in certain areas. For instance, vehicle-related crimes might be more common in high traffic or urban areas.
- Reasoning: The heatmap suggests certain crime types have hotspots in specific areas.
The dataset contains 23 columns with over 500,000 rows, including the following key attributes relevant to the hypothesis:
- Crm_Cd_Desc: Describes the type of crime.
- Area_Name: Provides the name of the area where the crime occurred.
- Lat and Lon: Coordinates for geographical analysis.
- Premis_Desc: Description of the location of the crime.
- Date_Occ and Time_Occ: Provide date and time of occurrence.
To explore the relationship between crime types and areas, we will focus on Crm_Cd_Desc and Area_Name and analyze their distribution. We will also visualize potential hotspots using heatmaps or similar methods.
Let’s start by examining the most frequent crime types per area.
# Grouping data by Area_Name and Crm_Cd_Desc to find the most common crimes in each area
crime_area_group = (
df.groupby(['Area_Name', 'Crm_Cd_Desc'])
.size()
.reset_index(name='Count')
)
# Finding the most frequent crime type per area
most_frequent_crimes_per_area = (
crime_area_group.loc[crime_area_group.groupby('Area_Name')['Count'].idxmax()]
.sort_values(by='Count', ascending=False)
)
# Display the results
most_frequent_crimes_per_area
| Area_Name | Crm_Cd_Desc | Count | |
|---|---|---|---|
| 94 | Central | BURGLARY FROM VEHICLE | 8117 |
| 480 | Hollywood | BURGLARY FROM VEHICLE | 3707 |
| 69 | 77th Street | THEFT OF IDENTITY | 3449 |
| 950 | Pacific | BURGLARY FROM VEHICLE | 3147 |
| 1162 | Southeast | THEFT OF IDENTITY | 3064 |
| 1439 | West LA | BURGLARY FROM VEHICLE | 3004 |
| 637 | N Hollywood | BURGLARY FROM VEHICLE | 2968 |
| 870 | Olympic | BURGLARY FROM VEHICLE | 2852 |
| 1600 | Wilshire | BURGLARY FROM VEHICLE | 2807 |
| 794 | Northeast | BURGLARY FROM VEHICLE | 2672 |
| 227 | Devonshire | THEFT OF IDENTITY | 2543 |
| 1354 | Van Nuys | BURGLARY FROM VEHICLE | 2468 |
| 1271 | Topanga | BURGLARY | 2457 |
| 712 | Newton | BURGLARY FROM VEHICLE | 2408 |
| 1519 | West Valley | BURGLARY FROM VEHICLE | 2319 |
| 1246 | Southwest | THEFT OF IDENTITY | 2291 |
| 1036 | Rampart | BURGLARY FROM VEHICLE | 2211 |
| 305 | Foothill | THEFT OF IDENTITY | 2096 |
| 615 | Mission | THEFT OF IDENTITY | 2032 |
| 454 | Hollenbeck | THEFT OF IDENTITY | 1692 |
| 375 | Harbor | THEFT OF IDENTITY | 1353 |
Identifies the most frequent crime type in each area by grouping the dataset by Area_Name and Crm_Cd_Desc, calculating counts, and filtering for the most common crime per area. The analysis highlights distinct crime patterns across regions.
Findings:¶
- "Burglary from Vehicle" is most common in areas like Central, Hollywood, and Pacific, with Central reporting the highest count (8,117 incidents).
- "Theft of Identity" dominates areas such as Southeast, West LA, and Devonshire.
- Topanga reports "Burglary" as the most frequent crime, showing regional variation.
1. Overall Crime Type Distribution¶
# Plot the overall crime type distribution
crime_type_counts = df['Crm_Cd_Desc'].value_counts().head(10)
# Retry plotting the overall crime type distribution
crime_type_counts.plot(kind='bar')
plt.title('Top 10 Crime Types')
plt.ylabel('Count')
plt.xlabel('Crime Type')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
The bar chart visualizes the top 10 most frequent crime types, emphasizing the dominance of property-related offenses, particularly those involving vehicles and theft.
Findings:¶
"Burglary from Vehicle" leads significantly with over 40,000 incidents, making it the most common crime.
Crimes like "Theft of Identity", "Burglary", and "Theft from Motor Vehicle - Grand ($950.01 and Over)" also show high prevalence.
Less frequent crimes, including "Robbery", "Vandalism - Misdemeanor ($399 or Under)", and "Brandish Weapon", still feature prominently in the dataset.
2. Top Crime Types by Area¶
# Group data by Area_Name and Crm_Cd_Desc to find the most common crimes in each area
crime_area_group = (
df.groupby(['Area_Name', 'Crm_Cd_Desc'])
.size()
.reset_index(name='Count')
)
# Find the most frequent crime type per area
most_frequent_crimes_per_area = (
crime_area_group.loc[crime_area_group.groupby('Area_Name')['Count'].idxmax()]
.sort_values(by='Count', ascending=False)
)
# Get the top 10 areas with the highest count of a specific crime type
top_crimes_by_area = most_frequent_crimes_per_area.head(10)
plt.barh(top_crimes_by_area['Area_Name'], top_crimes_by_area['Count'])
plt.xlabel('Count')
plt.ylabel('Area Name')
plt.title('Top Crime Types by Area')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# Display the results for analysis
most_frequent_crimes_per_area.head(10)
| Area_Name | Crm_Cd_Desc | Count | |
|---|---|---|---|
| 94 | Central | BURGLARY FROM VEHICLE | 8117 |
| 480 | Hollywood | BURGLARY FROM VEHICLE | 3707 |
| 69 | 77th Street | THEFT OF IDENTITY | 3449 |
| 950 | Pacific | BURGLARY FROM VEHICLE | 3147 |
| 1162 | Southeast | THEFT OF IDENTITY | 3064 |
| 1439 | West LA | BURGLARY FROM VEHICLE | 3004 |
| 637 | N Hollywood | BURGLARY FROM VEHICLE | 2968 |
| 870 | Olympic | BURGLARY FROM VEHICLE | 2852 |
| 1600 | Wilshire | BURGLARY FROM VEHICLE | 2807 |
| 794 | Northeast | BURGLARY FROM VEHICLE | 2672 |
The table highlights the most frequent crime types in each area, revealing distinct patterns of geographic concentration and dominance of certain offenses.
Findings:¶
- "Burglary from Vehicle" is the leading crime in areas like Central (8,117 incidents), Hollywood, and Pacific.
- "Theft of Identity" is most common in areas such as 77th Street, Southeast, and Devonshire.
- West LA and North Hollywood also report high occurrences of "Burglary from Vehicle", underscoring its prevalence.
3. Temporal Analysis: Analyze crime trends over time¶
# Remove the last month in the dataset for temporal analysis
df['Date_Occ'] = pd.to_datetime(df['Date_Occ'], errors='coerce')
latest_month = df['Date_Occ'].max().month
latest_year = df['Date_Occ'].max().year
# Filter out the last month and create a copy to avoid warnings
filtered_data = df[
~((df['Date_Occ'].dt.month == latest_month) & (df['Date_Occ'].dt.year == latest_year))
].copy() # Use .copy() here to ensure it's a new DataFrame
# Extract year and month for temporal analysis
filtered_data['Year'] = filtered_data['Date_Occ'].dt.year
filtered_data['Month'] = filtered_data['Date_Occ'].dt.month
# Group data by Year and Month for crime trends
temporal_trends_filtered = (
filtered_data.groupby(['Year', 'Month'])
.size()
.reset_index(name='Crime_Count')
.sort_values(by=['Year', 'Month'])
)
# Plotting the temporal trends without the last month
plt.figure(figsize=(14, 8))
plt.plot(
temporal_trends_filtered['Year'].astype(str) + '-' + temporal_trends_filtered['Month'].astype(str),
temporal_trends_filtered['Crime_Count'],
marker='o'
)
plt.title('Crime Trends Over Time')
plt.xlabel('Time (Year-Month)')
plt.ylabel('Number of Crimes')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
This analysis examines temporal trends in crime by grouping incidents by year and month, excluding the latest incomplete month to ensure accurate insights.
Findings:¶
- Crime counts generally increase over the analyzed period, peaking mid-way before showing a decline toward the end.
- Fluctuations in the trends suggest possible seasonal or external factors influencing criminal activity.
These findings highlight temporal patterns, aiding in better resource allocation and intervention planning.
4. Premises Analysis: Study relationships between crime types and locations.¶
# Group data by Premis_Desc and Crm_Cd_Desc to find the most common crime types at each location type
premises_crime_group = (
df.groupby(['Premis_Desc', 'Crm_Cd_Desc'])
.size()
.reset_index(name='Count')
.sort_values(by='Count', ascending=False)
)
# Get the top 10 premises with the most frequent crimes
top_premises_crimes = premises_crime_group.head(10)
# Plot the top premises for crimes
plt.barh(top_premises_crimes['Premis_Desc'], top_premises_crimes['Count'])
plt.xlabel('Number of Crimes')
plt.ylabel('Premises Description')
plt.title('Top Premises for Crimes')
plt.gca().invert_yaxis() # Invert y-axis for better readability
plt.tight_layout()
plt.show()
This analysis identifies the most common premises where crimes occur, focusing on top locations and their frequency through data grouping and visualization.
Key Findings:¶
- Single Family Dwelling: The most frequent crime location, with over 25,000 incidents, emphasizing residential areas as significant crime sites.
- Street: The second most common location, highlighting public spaces as key areas of concern.
- Parking Lot: The third most frequent site, pointing to potential security issues in these areas.
These findings underscore the need for targeted safety measures in both residential and public spaces to address crime effectively.
df.head()
| DR_NO | Date_Rptd | Date_Occ | Time_Occ | Area | Area_Name | Rpt_Dist_No | Part_1_2 | Crm_Cd | Crm_Cd_Desc | ... | Vict_Descent | Premis_Cd | Premis_Desc | Status | Status_Desc | Location | Lat | Lon | Year_Month | Time_Binned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200106753 | 2020-02-09 | 2020-02-08 | 1800 | 1 | Central | 182 | 1 | 330 | BURGLARY FROM VEHICLE | ... | O | 128.0 | 'BUS STOP/LAYOVER (ALSO QUERY 124)' | IC | 'Invest Cont' | '1000 S FLOWER ST' | 34.0444 | -118.2628 | 2020-02 | Evening to Midnight |
| 1 | 200907217 | 2023-05-10 | 2020-03-10 | 2037 | 9 | Van Nuys | 964 | 1 | 343 | SHOPLIFTING-GRAND THEFT ($950.01 & OVER) | ... | O | 405.0 | 'CLOTHING STORE' | IC | 'Invest Cont' | '14000 RIVERSIDE DR' | 34.1576 | -118.4387 | 2020-03 | Evening to Midnight |
| 2 | 220614831 | 2022-08-18 | 2020-08-17 | 1200 | 6 | Hollywood | 666 | 2 | 354 | THEFT OF IDENTITY | ... | H | 102.0 | SIDEWALK | IC | 'Invest Cont' | '1900 TRANSIENT' | 34.0944 | -118.3277 | 2020-08 | Noon to Evening |
| 3 | 231808869 | 2023-04-04 | 2020-12-01 | 2300 | 18 | Southeast | 1826 | 2 | 354 | THEFT OF IDENTITY | ... | H | 501.0 | 'SINGLE FAMILY DWELLING' | IC | 'Invest Cont' | '9900 COMPTON AV' | 33.9467 | -118.2463 | 2020-12 | Evening to Midnight |
| 4 | 220314085 | 2022-07-22 | 2020-05-12 | 1110 | 3 | Southwest | 303 | 2 | 354 | THEFT OF IDENTITY | ... | B | 248.0 | 'CELL PHONE STORE' | IC | 'Invest Cont' | '2500 S SYCAMORE AV' | 34.0335 | -118.3537 | 2020-05 | Morning to Noon |
5 rows × 23 columns
# Create a geometry column from LAT/LON coordinates
geometry = [Point(lon, lat) for lon, lat in zip(df_cleaned['Lon'], df_cleaned['Lat'])]
# Create a GeoDataFrame
gdf = gpd.GeoDataFrame(df_cleaned, geometry=geometry)
# Set the coordinate reference system (CRS) to WGS84
gdf.set_crs(epsg=4326, inplace=True)
# Display the first few rows of the GeoDataFrame
gdf.head()
| DR_NO | Date_Rptd | Date_Occ | Time_Occ | Area | Area_Name | Rpt_Dist_No | Part_1_2 | Crm_Cd | Crm_Cd_Desc | ... | Vict_Sex | Vict_Descent | Premis_Cd | Premis_Desc | Status | Status_Desc | Location | Lat | Lon | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200106753 | 2020-02-09 | 2020-02-08 | 1800 | 1 | Central | 182 | 1 | 330 | 'BURGLARY FROM VEHICLE' | ... | M | O | 128.0 | 'BUS STOP/LAYOVER (ALSO QUERY 124)' | IC | 'Invest Cont' | '1000 S FLOWER ST' | 34.0444 | -118.2628 | POINT (-118.2628 34.0444) |
| 1 | 200907217 | 2023-05-10 | 2020-03-10 | 2037 | 9 | 'Van Nuys' | 964 | 1 | 343 | 'SHOPLIFTING-GRAND THEFT ($950.01 & OVER)' | ... | M | O | 405.0 | 'CLOTHING STORE' | IC | 'Invest Cont' | '14000 RIVERSIDE DR' | 34.1576 | -118.4387 | POINT (-118.4387 34.1576) |
| 2 | 220614831 | 2022-08-18 | 2020-08-17 | 1200 | 6 | Hollywood | 666 | 2 | 354 | 'THEFT OF IDENTITY' | ... | M | H | 102.0 | SIDEWALK | IC | 'Invest Cont' | '1900 TRANSIENT' | 34.0944 | -118.3277 | POINT (-118.3277 34.0944) |
| 3 | 231808869 | 2023-04-04 | 2020-12-01 | 2300 | 18 | Southeast | 1826 | 2 | 354 | 'THEFT OF IDENTITY' | ... | M | H | 501.0 | 'SINGLE FAMILY DWELLING' | IC | 'Invest Cont' | '9900 COMPTON AV' | 33.9467 | -118.2463 | POINT (-118.2463 33.9467) |
| 4 | 220314085 | 2022-07-22 | 2020-05-12 | 1110 | 3 | Southwest | 303 | 2 | 354 | 'THEFT OF IDENTITY' | ... | F | B | 248.0 | 'CELL PHONE STORE' | IC | 'Invest Cont' | '2500 S SYCAMORE AV' | 34.0335 | -118.3537 | POINT (-118.3537 34.0335) |
5 rows × 22 columns
Converts the cleaned dataset into a geospatial format for mapping and spatial analysis.
Steps:¶
Create Geometry Column:
- Combines latitude (
Lat) and longitude (Lon) coordinates intoPointobjects for each record using theshapely.geometry.Pointclass.
- Combines latitude (
Create GeoDataFrame:
- Converts the
df_cleanedDataFrame into a GeoDataFrame (gdf) usinggeopandas.GeoDataFrame, incorporating the geometry column.
- Converts the
Set Coordinate Reference System (CRS):
- Sets the CRS to WGS84 (EPSG:4326), a standard for geographic coordinates, enabling accurate mapping and geospatial analysis.
Preview GeoDataFrame:
- Displays the first 5 rows of the GeoDataFrame, which now includes a
geometrycolumn for spatial representation.
- Displays the first 5 rows of the GeoDataFrame, which now includes a
Purpose:¶
This prepares the dataset for geospatial analysis, allowing crimes to be visualized on maps and enabling spatial queries to identify trends or hotspots.
# Create a map centered around the mean latitude and longitude of the crime locations
map_center = [df_cleaned['Lat'].mean(), df_cleaned['Lon'].mean()]
# Prepare data for HeatMap (LAT/LON coordinates)
heat_data = [[row['Lat'], row['Lon']] for _, row in df_cleaned.iterrows()]
# Create the HeatMap
heatmap = folium.Map(location=map_center, zoom_start=12)
HeatMap(heat_data).add_to(heatmap)
# Display the heatmap
heatmap